The influenza A virus (IAV) causes a respiratory illness that presents with fever, cough, muscle and joint pains, headache, sore throat, and runny nose. IAV is a member of the Orthomyxoviridae family and contains 8 negative-sense single stranded RNA segments {Urbaniak:2014wj}{Nelson:2015cg}{Webster:2014jt}. The major viral antigens of IAV are the surface glycoproteins hemagglutinin (HA) and neuraminidase (NA), these proteins are used to subtype the IAV into sixteen HA subtypes and nine NA subtypes {Webster:2014jt}. Influenza A virus has numerous host including water fowls, pigs, bats, and humans {Urbaniak:2014wj}{Nelson:2015cg}{Webster:2014jt}. Various mechanisms can change HA and NA in a specific IAV strain causing vaccines deficiencies {Goka:2014cz}. Two of these mechanisms are antigenic shift and antigenic drift. Due to the lack of proofreading mechanisms in IAV RNA replication, mutations can be easily introduced {Webster:2014jt}. These mutation can cause antigenic drift by changing base pairs in HA and NA making vaccines ineffective. Another way in which an IAV strain can change is by antigenic shift. Antigenic shift occurs when two or more IAV infect a host, different viral strains exchange genetic segments, and viral reassortment occurs {Urbaniak:2014wj}{Nelson:2015cg}{Webster:2014jt}. Pandemic strains can occur after antigenic shift improves an IAV strain virulence, evasion of the host's immune system, or introduces a new set of glycoproteins that the majority of the population have not been previously exposed to {Urbaniak:2014wj}{Nelson:2015cg}{Webster:2014jt}. Pigs are excellent host for antigenic shift to occur due to there expression of both sialic acid receptors for avian strains and mammalian strains {Nelson:2015cg}. With this information in mind, we want to determine if the host plays a role in the rate of mutation in IAV. To determine if the mutation rate was dependent on the host, we compared H1N1 strains collected from different years in both human and pig.
The sequence for H1N1 influenza A virus found in humans or pigs for the years of 1935, 1978, 2009, and 2014 was downloaded from the NCBI Influenza virus resource (http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html). The search criteria was the following; type A influenza virus, host either human or swine, northern temperate region, subtype H1 N1, and the aforementioned years.
In [ ]:
from Bio import SeqIO, AlignIO, Phylo
from Bio.Align.Applications import ClustalwCommandline
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio.Graphics import GenomeDiagram
from reportlab.lib import colors
from reportlab.lib.units import cm
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import heatmap
import multigroup_barchart
import os, sys
%matplotlib inline
#Make a variable point to where ClustalW2 is on your computer
clustalw_exe = r"C:\Program Files (x86)\ClustalW2\clustalw2.exe"
#Make sure you are in the same directory as the sequence FASTA files
In [ ]:
years = [1935,1978,2009,2014]
genomes = {}
#Create dict w/ each complete genome as a list w/ key = h or s followed by year
for year in range(len(years)):
genomes['h%i'%(years[year])] = list(SeqIO.parse('human_%i_FASTA.fa'%(years[year]), 'fasta'))
genomes['s%i'%(years[year])] = list(SeqIO.parse('swine_%i_FASTA.fa'%(years[year]), 'fasta'))
In [ ]:
#Concatenate all segments from each genome into their own dict entry, key = h/sYEAR_all
for year in range(len(years)):
genomes['h%i_all'%(years[year])] = genomes['h%i'%(years[year])][0]
genomes['s%i_all'%(years[year])] = genomes['s%i'%(years[year])][0]
for seg in range(1,8):
genomes['h%i_all'%(years[year])] = genomes['h%i_all'%(years[year])] + genomes['h%i'%(years[year])][seg]
genomes['s%i_all'%(years[year])] = genomes['s%i_all'%(years[year])] + genomes['s%i'%(years[year])][seg]
genomes['h%i_all'%(years[year])].id = 'h' + str(years[year])
genomes['s%i_all'%(years[year])].id = 's' + str(years[year])
In [ ]:
#Create lists of SeqRecord objects for all human H1N1 genomes, all swine, and both combined
all_human = [genomes['h1935_all']]
all_swine = [genomes['s1935_all']]
all_seq = [genomes['h1935_all'],genomes['s1935_all']]
for year in range(1,4):
all_human.append(genomes['h%i_all'%(years[year])])
all_seq.append(genomes['h%i_all'%(years[year])])
all_swine.append(genomes['s%i_all'%(years[year])])
all_seq.append(genomes['s%i_all'%(years[year])])
#Write these to FASTA files, so ClustalW can align them
SeqIO.write(all_human,'all_human.fa','fasta');
SeqIO.write(all_swine,'all_swine.fa','fasta');
SeqIO.write(all_seq,'all_seq.fa','fasta');
In [ ]:
#Align all human sequences
cline_human = ClustalwCommandline(clustalw_exe,infile='all_human.fa')
stdout, stderr = cline_human()
In [ ]:
#Align all swine sequences
cline_swine = ClustalwCommandline(clustalw_exe,infile='all_swine.fa')
stdout, stderr = cline_swine()
In [ ]:
#Align all sequences
cline_all = ClustalwCommandline(clustalw_exe,infile='all_seq.fa')
stdout, stderr = cline_all()
In [ ]:
#Import the aligned sequences
all_aln = AlignIO.read('all_seq.aln','clustal')
all_aln.sort()
print all_aln
In [ ]:
#collect ids and separate based on host (letter prefex preceding the year)
all_ids = [alignment.id for alignment in all_aln]
h_ids = all_ids[:4]
s_ids = all_ids[4:]
print all_ids,'\n', h_ids, s_ids
The cell below contains all the functions used to make the track diagrams seen in the Results section.
In [ ]:
def get_diff_locations(ref,compare):
"""
Takes 2 Biopython Seq objects of equal size (aligned).
Returns a list of locations at which base in compare sequence != base in ref sequence.
"""
if len(ref) != len(compare):
raise ValueError('Seqs are not the same length!')
diff_locations = []
for i in range(len(ref)):
if ref[i] != compare[i]:
diff_locations.append(i)
return diff_locations
def plot_tracks(track_diagram,path):
'''
Save track_diagram as pngs, then load them as images and display with matplotlib.
'''
#Write each diagram to a png, then read it back in using Matplotlib
track_diagram.draw(format='linear', pagesize=(14*cm,7*cm), fragments=1,
start=0, end=13378)
#get extension from tracks path
root, ext = os.path.splitext(path)
if not ext:
ext = '.png'
tracks_path += ext
track_diagram.write(path, ext.strip('.'), dpi=600)
tracks_im = mpimg.imread(path)
#Plot each set of tracks
fig = plt.figure(figsize=(14,7),dpi=600)
ax = fig.add_axes([0.025,0.025,0.95,0.95],frameon=False)
plt.axis('off')
plt.imshow(tracks_im)
def setup_track_diagram(diagram_name, ids, length):
'''
Inits empty track diagram (with diagram_name as title), tracks, and feature sets for all ids.
Adds back ground color to each track (for contrast).
Returns tracks, features, and feature_sets.
'''
#Create Genome Diagram for human
track_diagram = GenomeDiagram.Diagram(diagram_name+' H1N1 Tracks Plot', tracklines = 0)
track = {}
feature_set = {}
#Generate tracks and feature sets for each year
track_count = 1
for i in ids:
track[i] = track_diagram.new_track(track_count, name=i, greytrack=False)#greytrack=True,
#greytrack_labels=1, greytrack_fontcolor = colors.cornsilk,
#greytrack_fontsize=15)
feature_set[i] = track[i].new_set(name=i)
#add contrast background
feature_set[i].add_feature(SeqFeature(FeatureLocation(0,length)),
label=True, label_angle=0, color=colors.cadetblue)
track_count += 1
return track_diagram, track, feature_set
def add_features_year_compare(feature_set, diffs, hostspecies, base_year):
for k, features in feature_set.items():
year = k
if base_year == k:
#if the base year...
label = '%s: %s (reference)'%(hostspecies,year)
else:
#name bg feature (which labels the track)
label ='%s: %s vs. %s --- %i differences'%(hostspecies, year, base_year, len(diffs[year]))
# For each difference recorded in diffs, add as a feature to feature_set
for i in diffs[k]:
feature = SeqFeature(FeatureLocation(i,i))
features.add_feature(feature, name=label, color=colors.aqua)
# set the name of the first feature (the background feature) to label
# this is stupid, but I'm not sure how to do it better
features.get_features()[0].name = label
def compare_strains_by_host(diffs):
#Create Genome Diagram to compare human and swine in each year
diagram = GenomeDiagram.Diagram('Human vs. Swine H1N1 Tracks Plot')
tracks = {}
feature_set = {}
for n,(k,v) in enumerate(diffs.iteritems(),1):
tracks[k] = diagram.new_track(n, greytrack=False)
feature_set[k] = tracks[k].new_set()
return diagram, tracks, feature_set
def add_features_host_compare(feature_set, diffs):
'''
Add features to feature_sets from diffs.
'''
for k,v in feature_set.iteritems():
feature_set[k].add_feature(SeqFeature(FeatureLocation(0,13378)),
name = 'Human vs. Swine: %s --- %i differences'%(k,len(diffs[k])),
label=True, label_angle=0, color=colors.cadetblue)
for i in diffs[k]:
feature = SeqFeature(FeatureLocation(i,i))
feature_set[k].add_feature(feature, color=colors.aqua)
In [ ]:
# make comparisons in prep for track diagrams,
# locate differences between alignments
h_diffs = {} #for comparing between human strains across diff. years
s_diffs = {} #for comparing between swine strains across diff. years
h_s_diffs = {} #for comparing between host strains in the same year
for i in range(len(years)):
year_i = years[i]
h_diffs[year_i]={}
s_diffs[year_i]={}
for ii in range(len(years)):
year_ii = years[ii]
if i == ii:
#host comparisons
h_s_diffs[year_i] = get_diff_locations(all_aln[i].seq,all_aln[ii+4].seq)
else:
#year_i is the base_year (the reference year)
h_diffs[year_i][year_ii]=get_diff_locations(all_aln[i].seq,all_aln[ii].seq)
s_diffs[year_i][year_ii]=get_diff_locations(all_aln[i+4].seq,all_aln[ii+4].seq)
Make cross-year comparisons between strains with the same host.
In [ ]:
# make by-year (same host) comparisons
h_diagrams = {}
s_diagrams = {}
for base_year in h_diffs:
#human
h_diagram, h_features, h_feature_set = setup_track_diagram(diagram_name='Human',
ids = h_diffs[base_year].keys(),
length=all_aln.get_alignment_length())
add_features_year_compare(feature_set = h_feature_set, diffs = h_diffs[base_year],
hostspecies = 'Human',base_year = base_year)
h_diagrams[base_year] = h_diagram
#swine
s_diagram, s_features, s_feature_set = setup_track_diagram(diagram_name='Swine',
ids = s_diffs[base_year].keys(),
length=all_aln.get_alignment_length())
add_features_year_compare(feature_set = s_feature_set, diffs = s_diffs[base_year],
hostspecies = 'Swine',base_year = base_year)
s_diagrams[base_year] = s_diagram
Make cross-host comparisons between strains in the same year.
In [ ]:
# make cross-host (same year) comparisons
diagram, tracks, feature_set = compare_strains_by_host(h_s_diffs)
add_features_host_compare(feature_set, h_s_diffs)
Using the 1935 sequence of H1N1 to compare any changes in the genome from the samples obtain in 1978, 2009, and 2014 from either human or pig host, we can see a clear increase in the number of sequence changes as time progresses. The sequence from 2014 has more differences than the sequence from 1978. Interestingly when comparing the 1978 sequence to the 1935 reference, the H1N1 strain collected in pig had more differences, 1206, than the same year sequence found in human, 850. Furthermore the 2009 sequence show a greater number of differences in sample found in human, 2274, than the sample found in pig, 1976. Lastly, both pig and human H1N1 strains had a similar number of changes in 2014, 2374 for human and 2247 for pig. These data shows that the H1N1 genome found in pig change more rapidly from 1978 to 2009, gaining 1846 differences. The highest number of changes happened between 1978 and 2009, the H1N1 virus found in human added 2323 changes.
In the following track diagrams, brighter colors indicate differences from the reference strain.
In [ ]:
base_year = 1935
plot_tracks(h_diagrams[base_year],'Humanstrain_compared_against_year_%d.png'%base_year)
plot_tracks(s_diagrams[base_year],'Swinestrain_compared_against_year_%d.png'%base_year)
In [ ]:
base_year = 1978
plot_tracks(h_diagrams[base_year],'Humanstrain_compared_against_year_%d.png'%base_year)
plot_tracks(s_diagrams[base_year],'Swinestrain_compared_against_year_%d.png'%base_year)
In [ ]:
base_year = 2009
plot_tracks(h_diagrams[base_year],'Humanstrain_compared_against_year_%d.png'%base_year)
plot_tracks(s_diagrams[base_year],'Swinestrain_compared_against_year_%d.png'%base_year)
In [ ]:
base_year = 2014
plot_tracks(h_diagrams[base_year],'Humanstrain_compared_against_year_%d.png'%base_year)
plot_tracks(s_diagrams[base_year],'Swinestrain_compared_against_year_%d.png'%base_year)
When comparing the same year sequence of H1N1 between the two hosts, we find that the genomes from 1935 have 1349 differences, in 1978, the differences are 2004, and in 2009, there are 1112 changes. Surprisingly in 2014, the differences between the H1N1 genome found in human and pig is only 136. This could suggest that the 2014 H1N1 found in humans and pigs is more closely related than the strains found in 1935, 1978, or 2009.
In [ ]:
plot_tracks(diagram, 'host_compare.png')
Plotting the differences in genome over time we can see that H1N1 changed faster in pig between 1935 and 1978, and then we see a decrease in the rate between 1978 and 2004. Finally we see an increase in the rate of change between 2009 and 2014. In humans we see a different pattern, the highest rate of change in the H1N1 genome in a human host happened between 1978 and 2009, with a slow rate between 1935 and 1978, and the slowest rate of change happening between 2009 and 2014 when only 100 changes occurred.
In [ ]:
years = [1935,1978,2009,2014]
diffs = {}
for i in range(len(years)):
for ii in range(len(years)):
diffs['h_%i_%i'% (years[i],years[ii])] = get_diff_locations(all_aln[i].seq,all_aln[ii].seq)
diffs['s_%i_%i'% (years[i],years[ii])] = get_diff_locations(all_aln[i+4].seq,all_aln[ii+4].seq)
diffs['h_%i_s_%i'% (years[i],years[ii])] = get_diff_locations(all_aln[i].seq,all_aln[ii+4].seq)
In [ ]:
#Plot differences vs. 1935 over time
years = [1935,1978,2009,2014]
#Lists of human and siwne diff lengths vs. 1935
hdl = [len(diffs['h_1935_%i'%years[i]]) for i in range(len(years))]
sdl = [len(diffs['s_1935_%i'%years[i]]) for i in range(len(years))]
fig_diffs_over_time = plt.figure(figsize=(6,4), dpi=600);
ax_diffs_over_time = fig_diffs_over_time.add_axes([0.025,0.025,0.95,0.95])
hline = plt.plot(years,hdl, 'bo-', label='Human Viruses')
sline = plt.plot(years,sdl, 'g^-', label='Swine Viruses')
ax_diffs_over_time.set_xlabel('Year')
ax_diffs_over_time.set_ylabel('Differences vs. 1935')
plt.legend(loc='best');
In [ ]:
#Plot differences vs. 1935 as a percentage of total genome length over time
years = [1935,1978,2009,2014]
#Lists of human and siwne diff lengths vs. 1935 as a percentage of total genome length
hdl_percent = [i*100./13378. for i in hdl]
sdl_percent = [i*100./13378. for i in sdl]
fig_diffs_over_time2 = plt.figure(figsize=(6,4), dpi=600);
ax_diffs_over_time2 = fig_diffs_over_time2.add_axes([0.025,0.025,0.95,0.95])
hline2 = plt.plot(years,hdl_percent, 'bo-', label='Human Viruses')
sline2 = plt.plot(years,sdl_percent, 'g^-', label='Swine Viruses')
ax_diffs_over_time2.set_xlabel('Year')
ax_diffs_over_time2.set_ylabel('% Change Since 1935')
plt.legend(loc='best');
The phylogenetic trees for the aligned strains provide another way to quantify differences between strains. Phylogenetic Trees were created by Clustal during the alignment process. The distance values between nodes calculated by Clustal during alignment equal the 'number of substitutions as a proportion of the length of the alignment (excluding gaps)' (http://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny/help/faq.html). This distance value provides another metric by which we can compare strains. In the phylogenetic trees below, strains are labels by the first letter of the host species, followed by the year.
In [ ]:
tree = Phylo.read('all_seq.dnd',"newick")
human_tree = Phylo.read('all_human.dnd',"newick")
swine_tree = Phylo.read('all_swine.dnd',"newick")
Phylo.draw(tree)
Phylo.draw(human_tree)
Phylo.draw(swine_tree)
In [ ]:
import itertools
import numpy as np
def terminal_dists(self):
"""Return a list of distances between all terminals."""
def generate_pairs(self):
named_clades=[i for i in self.find_clades(terminal=True)]
s = itertools.combinations(named_clades,2)
return list(s)
return {(i[0].name,i[1].name): self.distance(*i) for i in generate_pairs(self)}
def compare_terminals_to_base_element(tree, base_element_name):
"""Return a list of distances between all terminals and specified base element"""
base_elem = tree.find_elements(name=base_element_name)
be = next(base_elem)
terminals = tree.get_terminals()
y_list,d_list = [],[]
for i in terminals:
if i != be:
y_list.append(i.name)
d_list.append(tree.distance(be,i))
return y_list, d_list
sd = terminal_dists(swine_tree)
hd = terminal_dists(human_tree)
yh,dh = compare_terminals_to_base_element(human_tree,'h1935')
ys,ds = compare_terminals_to_base_element(swine_tree,'s1935')
We compare the inter-year distances (multiplied by 100 for visualization purposes) between terminals of a single-host phylogeny tree from a base year (1935) to every other year. We can see the same trend.
In [ ]:
#data for barchart
data = zip(np.array(dh)*100,np.array(ds)*100) #note that the distances have been multiplied by 100
std = [[0 for ii in i]for i in data]
group_labels = ['human','swine']
years = [i[1:] for i in ys]
#make barchart
multigroup_barchart.bar_chart(data,std,years,group_labels)
plt.legend(loc=2)
plt.title('Group by host')
plt.show()
In [ ]:
#data for barchart
data = [np.array(dh)*100,np.array(ds)*100]
std = [np.zeros(len(i)) for i in data]
group_labels = ['human','swine']
years = [i[1:] for i in ys]
#make barchart
multigroup_barchart.bar_chart(data,std,group_labels,years)
plt.legend(loc=2)
plt.title('Group by year')
plt.show()
In [ ]:
import pandas as pd
def df_from_2_key_dict(twokey_dict):
'''
create a data frame showing distances between all the years.
This dataframe can be passed to the heatmap function
'''
temp ={}
for k,v in twokey_dict.iteritems():
#v is row value, k is row label
k1 = k[0]
k2 = k[1]
try:
#append to current sub-dict
temp[k1][k2] = v
except:
#start new sub-dict
temp[k1] = {}
temp[k1][k2] = v
return pd.DataFrame.from_dict(temp)
#heatmap annotation function (annotates each cell in heatmap)
mod_annotate = lambda x: heatmap.show_values(x,fmt="%.5f")
In the heatmaps below, zeroed cells indicate where no comparisons were made or comparisons would be redundant. Both heatmaps have the same color scale and can be directly compared.
In [ ]:
#Swine
compare = df_from_2_key_dict(sd).fillna(0)
heatmap.heatmap_from_dataframe(compare, annotate_function=mod_annotate,vmin_vmax=(0,0.175))
plt.title('Distance between each node (year) of the Phylogenetic Tree for swine strains')
plt.show()
In [ ]:
#Human
compare = df_from_2_key_dict(hd).fillna(0)
plot,bar = heatmap.heatmap_from_dataframe(compare, annotate_function=mod_annotate,vmin_vmax=(0,0.175))
plt.title('Distance between each node (year) of the Phylogenetic Tree for human strains')
plt.show()
The rate of change of the influenza A virus is different between humans and swine, but the rate is not constant within the host. At different years the rate of change for IAV was faster in pigs, from 1935 to 1978, having 1206 changes compared to 850 changes for IAV found in humans in the same years. Nonetheless difference rates are not always higher for IAV in swine, in the year 1978 to 2009 the IAV had 2323 changes in humans compared to 1846 differences found in swine. The differences in rate of change in IAV sequences found in swine might be explained by the changes in modern pig farming. The housing of pigs from out door pens to multi-level closed infrastructure and the increase in animals farmed at any one time have contributed to the increase potential of transmitting IAV among the farmed pigs {Looker:2003}. By increasing the number of hosts available in a single IAV infection, the number of changes increases therefore increasing the rate in which the IAV sequence changes.
Goka, E. A., Vallely, P. J., Mutton, K. J. & Klapper, P. E. (2014). Mutations associated with severity of the pandemic influenza A(H1N1)pdm09 in humans: a systematic review and meta-analysis of epidemiological evidence. Arch Virol 159, 3167–3183.
Nelson, M. I. & Vincent, A. L. (2015). Reverse zoonosis of influenza to swine: new perspectives on the human–animal interface. Trends Microbiol 23, 142–153. Elsevier Ltd.
Urbaniak, K. & Markowska-Daniel, I. (2014). In vivo reassortment of influenza viruses. Acta Biochim Pol 61, 427–431.
Webster, R. G. & Govorkova, E. A. (2014). Continuing challenges in influenza. Ann N Y Acad Sci 1323, 115–139 (K. Bush, Ed.).